Source: https://www.kaggle.com/zynicide/wine-reviews/downloads/wine-reviews.zip/4
My dataset is comprised of almost 130,000 reviews of invididual wines organized price, rating on a 0-100 point scale, nationality, type, year, taster, and winery of origin. Wine snobs annoy me, so I wanted to see if anything they have to say about quality holds water statistically.
data <- read.csv(file="winemag-data-130k-v2.csv")
data
Common assertions about wine include a relationship between price and quality, the statement that “X was a good year for Y wine,” and the idea that certain countries make better wines. I’m going to explore these relationships using this database.
First, the columns of the table.
colnames(data)
## [1] "X" "country"
## [3] "description" "designation"
## [5] "points" "price"
## [7] "province" "region_1"
## [9] "region_2" "taster_name"
## [11] "taster_twitter_handle" "title"
## [13] "variety" "winery"
These can be refined or removed to add clarity. X is wholly unnecessary in this environment, denoting a row ID, while description and designations’ use as qualitative data is irrelevant in the context of this paper.
data <- data %>% mutate(X=NULL, description=NULL, designation=NULL)
This removes those three columns from the table.
Next, what makes a good wine according to the data? Sorting mean rating by country and province of origin is easy enough.
mean_score_nationality <- data %>% select(points, country) %>% group_by(country)%>% summarize(score=mean(points))
mean_score_nationality
According to this, England produces the best wine on average, but a graphical aid would better display the differences between countries.
mean_score_nationality %>% ggplot(aes(x=country, y=score)) + geom_bar(stat="identity", width=.4)